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Abstract 

The combined universal probability M(£') of strings x in sets D is close to uiaxx^u M({a;}): 
f^ . their ^ logs differ by at most D's information j=I{D : H) about the halting sequence H. Thus 

O^l ' if all X have complexity K(a;) > k, D carries > i bits of information on each x where i+j ^ k. 

Note that there are no ways to generate D with significant 1{D : %). 

1—5 

^^ ■ 1 Introduction. 

^^ ■ Many intellectual and computing tasks require guessing the hidden part of the environment from 

^^ ■ available observations. In different fields these tasks have various names, such as Inductive Infer- 

c/3 \ ence. Extrapolation, Passive Learning, etc. The relevant part of the environment includes not only 

what interests the guesser, but also all data that might influence the guess; it can be represented as 
an, often huge, string x€{0, 1}*. The known observations restrict it to a set D 3 x. D is typically 
enormous, and many situations allow replacing it with a much more concise theory representing 

QQ I the relevant part of what is known about x. Yet, such approaches are ad hock and secondary: raw 

•/^ ■ observations are anyway their ultimate source. 

One popular approach to guessing, the "Occam Razor," tells to focus on the simplest members 
of D. (In words, attributed to A. Einstein, "A conjecture should be made as simple as it can be, 

f— ^ ■ but no simpler.") Its implementations vary: if two objects are close in simplicity, there may be 

legitimate disagreements of which is slightly simpler. This ambiguity is reflected in formalization 
of "simplicity" via the Kolmogorov Complexity function K(a;) - the length of the shortest prefix 
program^ generating x: K is defined only up to an additive constant depending of the programming 

K^ ■ language. This constant is small compared to the usually huge whole bit-length of x. More 

H ! mysterious is the justification of this Occam Razor principle. 

A more revealing philosophy is based on the idea of "Prior" . It assumes the guessing of x G L* is 
done by restricting to D an a priory probability distribution on {0, 1}*. Again, subjective differences 
are reflected in ignoring moderate factors: say in asymptotic terms, priors different by a 9{1) factors 
are treated as equivalent. The less we know about x (before observations restricting x to D) the 
more "spread" is the prior, i.e. the smaller would be the variety of sets that can be ignored due 
to their negligible probability. This means that distributions truly prior to any knowledge, would 
be the largest up to 9{1) factors. Among enumerable (i.e. generated by randomized algorithms on 
their outputs) distributions, such largest prior does in fact exist and is M({x}) = 2~^^^K 



*An on-line version: http://arxiv.org/abs/lf07.f458 

^Computer Science department, 111 Cummington Street, Boston, MA 02215. 
This research was partially supported by NSF grant CCF-1049505. 

^This analysis ignores issues of finding short programs efliciently. Limited-space versions of absolute complexity 
results usually are straightforward. Time-limited versions often are not, due to difficulties of inverting one-way 
functions. However the inversion problems have time-optimal algorithms. See such discussions in [Levin 09]. 



These ideas developed in [Solomonoff 64] and many subsequent papers do remove some mystery 
from the Occam Razor principle. Yet, they immediately yield a reservation: the simplest objects 
have each the highest universal probability, but it may still be negligible compared to the com- 
bined probability of complicated objects in D. This suggests that the general inference situation 
may be much more obscure than the widely believed Occam Razor principle describes it. 

The present paper shows this could not happen, except as a purely mathematical construction. 
Any such D has high information I{D : Ti) about Halting Problem Ti ("Turing's Password" :-). 
So, there are no ways to generate such D: see this informational version of Church- Turing Thesis 
discussed in the appendix of [Levin 10]. 

Consider finite sets D containing only strings of high (> k) complexity. One way to find such 
D is to generate at random a small number of strings x € {0, 1} . With a little luck, all x would 
have high complexity, but D would contain virtually all information about each of them. 

Another (less realistic :-) method is to gain access to the halting problem sequence 7i and use 
it to select for D strings x of complexity ~ k from among all A;-bit strings. 
Then D contains little information about most its x but much information about Ti ! 

Yet another way is to combine both methods. Let Vh be the set of all strings vs with K(i;s) -^ 
|['ys|| = ||i;||+/i. Then K(x) ^ i + h, I{D : x) ~ i, and I{D : Ti) ^-^ h for most i-hit v and x€D = Vh- 
We show no D can be better: they all contain strings of complexity < min^.g£) I{D : x)+I{D : 7i). 

Our result is a follow-up to Theorem 2 in [Vereshchagin, Vitanyi 10]. We refer to Appendix 
I of [Vereshchagin, Vitanyi 04] for more history of the concepts we use and to [Kolmogorov 65, 
Solomonoff 64, Li, Vitanyi 08] for more material on Algorithmic Information Theory. This work 
was initiated by S. Epstein and the part he contributed to it also appears in [Epstein, Betke 11]. 



2 Complexity Tools and Conventions. 

||x|| = n for xe{0, 1}"; for aGK+, |la|| = [ | loga]]. S = {0, 1}*. {pO~)={pl~)=p ; (0") is undefined. 
[A] = l if statement A holds, else [^]=0. -<f, yf, x/, and </, >/, ~/ denote </+0(l), 
>/-0(l), =/±0(l), and </+0(||/+l||), >/-0(||/+l||), =/±0(||/+l||), respectively. 
Q{G) is the probability of a set G or mean ^^, (5({x})G(2;) of a function G by a distribution Q. 

We use a prefix algorithm [/: U{p)=x iff C/(pO)=C/(pl)=a;. Auxiliary inputs y in Uy are not so 
restricted, p is total if f/ halts on all /c-bit ps for some /c. Our U is universal, i.e. minimizes (up 
to x) complexities K, ||M|| below, and left-total : if U{pls) halts, pO is total. 7i{i) = [f^(0 halts]. 

Kolmogorov complexity K.{x\y) is minp{||p|| : Uy{p)=x}. Universal probability M^,(G) 

is J2p^~^^^HU{v{p~))^U{vp)€G]. We omit empty \y,v. ||M({x})||xK(3;). A(d) = ||d|| +K(||d||). 
Information I{x : y) = K{x)+K{y)-K{x,y) x K(x)-K(x|(2/,K(2/))). 1(3; : n) = K{x)-K{x\n). 

Kolmogorov's "non-randomness" d{G\Q,v) is ||(5({G})||— K(G|'y). We use a slice xiG) = 
min^ Q=[/(-^)(||z;||+A(d(G|(5,i'))) of his Structure Function, requiring Q(S)=1 unlike [Slicn 83]. 

The results, of course, retain validity if relativized by giving U an extra auxiliary input. 

3 The Results. 

Lemma 1 || max.ezj M({x})|| ^ A(M(Z?)) + ||K(||M(D)||)|| + xP). 

Informal outline of the proof: We break inputs of U into ~M(Z))/d(i^|Q, 7;)-wide intervals pS. 
In each interval with total p we select one output Lp=U {pp') and update a Q-test t{X) {=tp{X)). 
Here (In t{X)) accumulates Mp(X), until L={Lr\r<p} intersects X upon which t[X) drops to 0. 
t{X) stops changing if its In exceeds ~d(Z?|Q,f), so the restriction of maXptp(X) to these high 
values (or 0) is lower-enumerable. Lp is selected to keep the mean Q{t) < 1. This is possible since 
mean choice of Lp does not increase Q(t), and the minimal increase cannot exceed the mean: 
this is the key point of the proof. At the end, small size of L limits complexity of its members, 
and high t{X) for X C S \ L with M{X) > M{D) assures d{X\Q, v) > d{D\Q, v), so X ^ D. 



Formal proof: Let v, Q=U{v) minimize x{D)- Given i, j, we build inductively a list {Lp€C/(jjS)} 
indexed by ah total pG{0, 1}*+^'. From {Lr\r<p} we define Q-tests t^{X): t^,+^. {X) = 1; f^+^iX) = 

t^{X) if [Int^(X)] > 2^ or p is not total; else t^+i(X) = t^(X)[Lp^X] exp(Mp(X)). 

Let Lp^s be {Lr\r<p} with added Lp=s. Using (1— a)exp(a) < 1 for a='M.p{X) we get: 
^,Mp({s})tp|'^^(X) < t^{X). So, Q(tp|'0 < Q(t^) for some s € U{pS). Such choices of Lp=s 
assure Q(t^)<l for ah pe{0, 1}*+^. Let N= exp(2J-l), Q-test t(X) = maxp (t^(X) [N< t^{X)]). 

L and t{X) are enumerable from v,i,j. Take i = ||M(L')||, d = d{D\Q,v), j^\\d+'K{i\v)\\. 
Then some s G LnD, as otherwise t{D)>N and d{D\Q,v) > d{D\Q, {v,i,j))-K{{i,d)\v)-0{l) > 
||iV||-||d||-K(i|u)-0(l) > d. And as seL, K(s) ^ i+j+K{i,j,v) -< i+K{i)+\\K{i)\\+x{D). U 

Theorem 1 min^gD K(x)x|| max^eD M({2;})|| < \\M{D)\\+I{D : %) ^min^eu 1(1? : x)+l{D:H). 

Proof. Let U{vw)=D, \\vw\\=\^{D), v be total, V be not. Using v, Q(G) = M^(G), 
we get x{D) ^ \{v)+\{\\w\\-¥.{D\v)) ^ \{v)+\{¥.{\\v\\)) since ||?;|| + ||u'|| ^ K{D) ^ 
K{D\v)+K{\\v\\)+\\v\\. Now, K(L'|?^)^K(||t;||)+||u;||, so I{D : ?^) >- ||?;||-K(||t>||) > x(^). 
Also for xeD, I{D : x) x K(x)-K(x|(i:>, K(i:»))>||M(i:»)||. So, Lemma 1 completes the proof. ■ 
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